USER-FRIENDLY TAIL BOUNDS 
FOR SUMS OF RANDOM MATRICES 



JOEL A. TROPP 

Abstract. This paper presents new probability inequalities for sums of independent, random, self- 
adjoint matrices. These results place simple and easily verifiable hypotheses on the summands, and 
they deliver strong conclusions about the large-deviation behavior of the maximum eigenvalue of 
the sum. Tail bounds for the norm of a sum of random rectangular matrices follow as an immediate 
corollary. The proof techniques also yield some information about matrix-valued martingales. 

In other words, this paper provides noncommutative generalizations of the classical bounds 
associated with the names Azuma, Bennett, Bernstein, Chernoff, Eloeffding, and McDiarmid. The 
matrix inequalities promise the same diversity of application, ease of use, and strength of conclusion 
that have made the scalar inequalities so valuable. 



1. Introduction 

Random matrices have come to play a significant role in computational mathematics. This line 
of research has advanced by using established methods from random matrix theory, but it has also 
generated difficult questions that cannot be addressed without new tools. Let us summarize some 
of the challenges that arise in numerical applications. 

• Research has extended well beyond the classical ensembles (e.g., Wishart matrices and 
Wigner matrices) to encompass many other classes of random matrices. For instance, it is 
now common to study the properties of a sparse matrix sampled from a fixed matrix or a 
random submatrix drawn from a fixed matrix. 

• We also encounter highly structured matrices that involve a limited amount of randomness. 
One important example is the randomized DFT, which consists of a diagonal matrix of 
random signs multiplied by a discrete Fourier transform matrix. 

• Questions about the spectral properties of random matrices remain fundamental, but mod- 
ern problems can also involve other considerations. For example, we might need to estimate 
the cut norm of a random adjacency matrix. Or we might want to study the action of a 
random operator on a class of vectors or matrices. 

• Most problems in numerical mathematics concern matrices of finite order. Asymptotic 
theory is less relevant in practice. 

• We often require explicit large-deviation theorems for statistics of random matrices so that 
we can study rates of convergence. 

• Results with effective constants are essential to ensure that algorithms are provably correct. 
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We have encountered these issues in a wide range of problems from computational mathemat- 
ics: smoothed analysis of Gaussian elimination [SST06]; semidefinite relaxation and rounding of 
quadratic maximization problems [Nem07, So09]; construction of maps for dimensionality reduc- 
tion [AC09]; matrix approximation by sparsification [AM07] and by sampling submatrices [RV07]; 
analysis of sparse approximation [Tro08] and compressive sampling [CR07] algorithms; randomized 
schemes for low-rank matrix factorization [HMTll]; and analysis of algorithms for completion of 
low-rank matrices [Groll, Rec09]. And this list is by no means comprehensive! 

In most of these applications, the methods currently invoked to study random matrices require 
a substantial amount of practice to use effectively. Even so, the final results tend to be a little 
disappointing: the constants are usually poor and the predictions are sometimes coarser than we 
might like. These frustrations have led us to search for simpler techniques that still yield detailed 
quantitative information about finite random matrices. 

1.1. Technical Overview. We consider a finite sequence {X^} of random, self-adjoint matrices 
with dimension d. Our goal is to harness basic properties of these matrices to bound the probability 



Here and elsewhere, Amax denotes the algebraically largest eigenvalue of a self-adjoint matrix. This 
formulation is more general than it may appear because we can exploit the same ideas to explore 
several related problems: 

• We can study the smallest eigenvalue of the sum. 

• We can bound the largest singular value of a sum of random rectangular matrices. 

• Related arguments apply to matrix martingales and other adapted sequences. 

Indeed, the expression (1.1) captures the essence of many questions that arise in numerical appli- 
cations of random matrix theory, including most of the research cited above. 

Observe that (1.1) formally resembles the probability that a sum of real random variables exceeds 
a certain level. The Laplace transform method, attributed to Bernstein, is a particularly elegant 
system for producing tail bounds for sums of scalar random variables; see [McD98, Lug09] for 
accessible discussions. In a remarkable paper [AW02], Ahlswede and Winter show how to transport 
the Laplace transform method to the matrix setting. They establish that 



In words, the probability (1.1) is controlled by a matrix version of the moment generating function 
(mgf). See Proposition 3.1 for an easy proof of (1.2) that is due to Oliveira [OlilOb]. 

The matrix Laplace transform estimate (1.2) presents a serious technical challenge. We must 
control the trace of the matrix mgf 



using information about the summands Xi, X2, X3, .... This estimate requires powerful tools, and 
it stands as the major impediment to bounding the tail probability (1.1). 

The true significance of the Ahlswede-Winter argument [AW02, App.] consists in their technique 
for computing the required bounds on the matrix mgf. We describe their method in §3.7. The 
following probability inequality for a matrix Gaussian series is typical of the results that emerge 
from their approach. Let {A^} be a family of fixed self-adjoint matrices with dimension d, and let 
{7fc} be a sequence of independent standard normal variables. Then 









where 
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The Ahlswede-Winter apparatus leads to a collection of other interesting probabihty inequahties; 
see §1.3 for references. Nevertheless, tail bounds developed in this fashion, including (1.3), are 
usually very far from optimal. See §3.7 and §4.8 for further discussion of this point. 

This paper describes a more satisfactory framework for completing the bound on the matrix mgf. 
The crucial new ingredient in our argument is a deep theorem [Lie73, Thm. 6] of Lieb from his 
seminal paper on convex trace functions. We introduce Lieb's theorem in §3.4, and we explain how 
to combine this result with the matrix Laplace transform technique. We use this scheme to obtain 
a large family of probability inequalities that are essentially sharp in a wide variety of situations. 

Our approach represents a dramatic advance beyond the Ahlswede-Winter technique. For ex- 
ample, our method delivers the following bound for a matrix Gaussian series: 



A 



max y Z_^i~ 'yk-^k 



> t} < d-e 



-42/2^2 



where a 



:i.4) 



The estimate (1.4) offers a fundamental advantage over (1.3) because the variance parameter o"^ 
is often d times smaller than crj^y^- Furthermore, the discussion in §4 demonstrates that the in- 
equality (1.4) cannot be sharpened without changing its structure. This improvement is typical of 
results constructed from our blueprint. 



1.2. Index of Inequalities. This work contains a large number of bounds for the probability (1.1). 
The precise form of each inequality depends on prior information about the summands. As a service 
to the reader, we have collected the most useful results in this section. We have also included a 
short qualitative discussion of each bound, along with the location in the paper where the full 
treatment appears. 

1.2.1. Notation. The symbol ^ denotes the semidefinite order on self-adjoint matrices. The maps 
Aniin and Amax return the algebraically smallest and largest eigenvalue of a self-adjoint matrix. We 
write II -11 for the spectral norm, which equals the largest singular value of a matrix. 

1.2.2. Main Results for Positive- Semidefinite Matrices. In classical probability theory, one of the 
most famous concentration results concerns the number of successes in a sequence of independent 
random trials. This quantity can be expressed as a sum of independent, bounded random variables. 
ChernofF's large-deviation theorem [Che52] provides explicit estimates on the probability that this 
type of series is greater than (or smaller than) a specified level. 

In the matrix setting, the analogous theorem concerns a sum of positive-semidefinite random 
matrices subject to a uniform eigenvalue bound. The matrix Chernoff inequality shows that the 
extreme eigenvalues of the matrix series have the same binomial-type behavior that occurs in the 
scalar case. 

Theorem 1.1 (Matrix Chernoff). Consider a finite sequence {X^} of independent, random, self- 
adjoint matrices with dimension d. Assume that each random matrix satisfies 

Xk ;?= and Aniax(^fc) ^ R almost surely. 



Define 
Then 



All 



^ EXfcj and fi 



max ■ — ^^max 



An 



{Amax (Efc ^k) > (1 + '^)^max} < d ■ 



{1-6) 
„(5 



1-5 



(1 + 5) 



l+<5 



Mmin/-R 



for 6 G [0, 1], and 



for S>0. 



Chernoff bounds are well suited to studying the spectrum of a random matrix with independent 
columns. For additional details and related inequalities, turn to §5. 
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1.2.3. Main Results for Self-Adjoint Matrices. Another basic example of concentration is provided 
by a sum of real numbers modulated by independent standard normal variables or, alternatively, by 
independent Rademacher^ random variables. A classical result shows that this type of random series 
exhibits subgaussian tails. When we replace the real numbers by self-adjoint random matrices, we 
discover that the maximum and minimum eigenvalue of the matrix sum retain this normal tail 
behavior. 

Theorem 1.2 (Matrix Gaussian and Rademacher Series). Consider a finite sequence { A^} of fixed, 
self-adjoint matrices with dimension d, and let {^fc} he a finite sequence of independent standard 
normal or independent Rademacher random variables. Then, for all t > 0, 

IP {A„,ax ikA^) >t]<d- e-*V2-^ u,here := || | . 

Theorem 1.2 was first established explicitly by Oliveira using a different method [OlilOb]. We 
have included the result here because it is very important and because it follows from a mechan- 
ical application of our techniques. Turn to §4 for an exhaustive discussion of matrix Gaussian 
series. This presentation also describes several new phenomena that arise when we translate scalar 
inequalities to the matrix setting. 

The Hoeffding inequality is a more general result that describes a sum of independent, zero- mean 
random variables that are subject to upper and lower bounds; it demonstrates that this random 
series exhibits normal concentration. We can extend this result to the matrix setting by considering 
random matrices that satisfy semidefinite upper bounds. In the matrix case, the maximum and 
minimum eigenvalues of the sum also have subgaussian behavior. 

Theorem 1.3 (Matrix Hoeffding). Consider a finite sequence {Xk} of independent, random, self- 
adjoint matrices with dimension d, and let {A^} he a sequence of fixed self-adjoint matrices. Assume 
that each random matrix satisfies 

E Xk = and ^ A| almost surely. 

Then, for all t > 0, 

IP{A^ax(j^^Xfc) >t} <(i-e-*'/S'^' where <t^-=\\^^AI\. 

The constant 1/8 in Theorem 1.3 can be improved when there is additional information available. 
See §7 for a discussion and some related results for martingales. 

In fact, a sum of independent, bounded random variables may vary substantially less than the 
Hoeffding bound suggests. A famous inequality of Bernstein demonstrates that this type of random 
series exhibits normal concentration near its mean on a scale determined by the variance of the sum. 
On the other hand, the tail of the sum decays subexponentially on a scale controlled by a uniform 
upper bound on the summands. Sums of independent random matrices exhibit the same type of 
behavior, where the normal concentration depends on a matrix generalization of the variance and 
the tails are controlled by a uniform bound on the maximum eigenvalue of each summand. 

Theorem 1.4 (Matrix Bernstein). Consider a finite sequence {Xf^.} of independent, random, self- 
adjoint matrices with dimension d. Assume that each random matrix satisfies 

EXfc = and \raa.x{Xk) < R almost surely. 

Then, for all t > 0, 

^{^^-^ (Efe ^0 > t} < • exp ( ^r/^^/g ) ^here := || E | . 



A Rademacher random variable is uniformly distributed on {±1}. 
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Independently, Oliveira has established a somewhat weaker version of Theorem 1.4 using alter- 
native techniques [OlilOa]. The reader is probably aware that the probability literature contains 
a huge number of results that extend Bernstein's inequality to include other a priori information 
on the summands, such as bounds on the rate of moment growth. Section 6 contains additional 
matrix probability inequalities of this species. 

1.2.4. Main Results for Rectangular Matrices. As an immediate corollary of our results for self- 
adjoint random matrices, we can also establish a collection of inequalities for the maximum singular 
value of a sum of random rectangular matrices. In each case, we extend the result to rectangular 
matrices by using a device from operator theory called the self-adjoint dilation (§2.6). Remark 3.11 
and §4.2 offer some discussion of this technique. This section presents two of the most important 
inequalities for sums of random rectangular matrices. 

As in the self-adjoint case, the norm of a Gaussian or Rademacher series with rectangular matrix 
coefficients has subgaussian tails. This result follows directly from Theorem 1.2; see §4.2 for a 
complete proof. Observe that the variance parameter changes to reflect the fact that the row and 
column spaces of a general matrix are independent from each other; the variance can be viewed as 
a noncommutative "sum of squares." 

Theorem 1.5 (Matrix Gaussian and Rademacher Series: Rectangular Case). Consider a finite 
sequence {-B^} of fixed matrices with dimension di x d2, and let {S,k} be a finite sequence of in- 
dependent standard normal or independent Rademacher random variables. Define the variance 
parameter 



max 



BkBl 



BIB, 



} 



Then, for all t > 0, 



We can also develop a rectangular version of the matrix Bernstein inequality. Notice the parallel 
between the variance parameter here and the variance parameter for a rectangular Gaussian series. 
This result is an immediate corollary of Theorem 1.4; a proof sketch appears in Remark 6.3. 

Theorem 1.6 (Matrix Bernstein: Rectangular Case). Consider a finite sequence {Z^} of indepen- 
dent, random matrices with dimensions di x d2. Assume that each random matrix satisfies 



EZfc = and \\Zk\\ < R almost surely. 



Define 

Then, for all t > 0, 



a 



max{||V E(ZfcZ^ 



l}- 



{||E.H^'}£W + ''^)-«p(;;^^^)' 



We trust that the reader can develop other probability inequalities for rectangular matrices as 
needed. For brevity, we have omitted further examples. 

1.2.5. Inequalities for Matrix Martingales. The techniques in this paper also lead directly to some 
simple results for matrix martingales. This material appears in §7. 

Azuma Inequality: The Azuma inequality is the martingale extension of the Hoeffding in- 
equality. 

McDiarmid Inequality: The McDiarmid bounded difference inequality concerns matrix- 
valued functions of a family of independent random variables. It demonstrates that the 
extreme eigenvalues of the matrix-valued function exhibit normal concentration. 
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For more refined martingale inequalities, see the papers [OlilOa, Trolla] and the technical re- 
port [Trollc]. 

1.3. Summary of Related Work. We continue with an overview of some related work on finite- 
dimensional random matrices. The first group of papers relies on the matrix extension of the 
Laplace transform method; the second group uses noncommutative moment inequalities. 

1.3.1. The Matrix Laplace Transform Method. The most important precedent for our work is the 
influential paper of Ahlswede and Winter [AW02]. They are responsible for developing the matrix 
version of the Laplace transform method, which shows that the tail probability (1.1) is controlled 
by a matrix generalization of the mgf. They describe an iterative argument, based on the Golden- 
Thompson inequality, (2.6) below, that allows them to provide a weak bound for the mgf of a sum of 
independent random matrices in terms of mgf bounds for the individual summands. In particular, 
they apply this technique to obtain an extension of the Chernoff inequality [AW02, Thm. 19]. 

The Ahlswede-Winter method for bounding the matrix mgf is quite general. Several other 
authors have exploited their technique to obtain matrix extensions of classical probability inequal- 
ities. Christofides and Markstrom establish a matrix version of the Azuma and Hoeffding inequal- 
ities [CMOS]. Gross [Groll, Thm. 6] and Recht [Rec09, Thm. 3.2] develop two different matrix 
extensions of Bernstein's inequality. We also refer the reader to Vershynin's note [Ver09], which 
offers a self-contained introduction to the Ahlswede-Winter circle of ideas. 

Results established within the Ahlswede-Winter framework are often sharp for sums of i.i.d. ran- 
dom matrices, but the inequalities are far less accurate when applied to other types of sums. 
Roughly speaking, the tail bounds have the correct shape, but the method often leads to poor esti- 
mates for the quantity that controls the scale of large deviations. For a specific example, compare 
the variance parameter in (1.3) with the (correct) variance parameter appearing in (1.4). All the 
results we have mentioned so far have this shortcoming. See §3.7 for technical details. 

Very recently, Oliveira has developed two notable variations [OlilOb, OlilOa] on the Ahlswede- 
Winter method for bounding the matrix mgf. These techniques can sometimes identify the correct 
matrix generalization of the scale parameter. In particular, the approach in [OlilOb] can be used to 
prove Theorem 1.2. Oliveira has also developed a version of the matrix Bernstein inequality [OlilOa, 
Thm. 1.2] that is similar to Theorem 1.4; his proof involves a matrix extension of the martingale 
techniques from [Fre75]. 

The current article was inspired by the work of Ahlswede-Winter [AW02] and Oliveira [OlilOb]. 
Our results were obtained independently from Oliveira's paper [OlilOa]. 

1.3.2. Noncommutative Moment Inequalities. There is another contemporary line of research that 
uses noncommutative (nc) moment inequalities to study random matrices. In a significant arti- 
cle [Rud99], Rudelson obtains an optimal estimate for the sample complexity of approximating the 
covariance matrix of a general isotropic distribution. The argument in his paper, which is due to 
Pisier, depends on a version of the nc Khintchine inequality [LP86, LPP91, Pis03]. 

Rudelson's technique has been applied widely over the last ten years, and it has emerged as 
a valuable tool for studying discrete random matrices. For example, the method can be used to 
provide bounds on the norm of a random submatrix [RV07, Thm. 1.8] drawn from a fixed matrix. It 
seems likely, however, that matrix probability inequalities will replace the nc Khintchine inequality 
for many applications because they are easier to use and often produce better results. 

By now, there is a substantial literature on other nc moment inequalities. The article [JX05] 
contains a reasonably accessible and comprehensive discussion. Some of these results have been 
applied to the study of random matrices; see [JX08] for an example. As we discuss in §4.7, nc 
moment bounds can also be combined with the matrix Laplace transform method because they 
sometimes provide an alternative way to control the matrix mgf. 
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1.4. Roadmap. The rest of the paper is organized as follows. Section 2 introduces the background 
results required for our proofs. Section 3 proves the main technical results that lead to probability 
inequalities for sums of independent random matrices. Section 4 uses Gaussian series as a case study 
to illustrate the main features of matrix probability inequalities and to argue that the bounds in 
this paper are structurally optimal. We develop the matrix Chernoff and Bernstein inequalities in 
§§5-6. Finally, we establish some simple martingale results in §7. 

2. Algebra, Analysis, and Probability with Matrices 

This section provides a short introduction to the background we require for our proofs. The 
proofs contain detailed cross-references to this material, so the reader may wish to proceed directly 
to the main thread of argument in §3. 

Most of these results can be located in Bhatia's books on matrix analysis [Bha97, Bha07]. 
The works of Horn and Johnson [HJ85, HJ94] also serve as good general references. Higham's 
book [Hig08] is an excellent source for information about matrix functions. 

2.1. Conventions on Matrices. A matrix is a finite, two-dimensional array of complex numbers. 
In this paper, all matrices are square unless otherwise noted. We add the qualification rectangular 
when we need to refer to a general array, which may be square or nonsquare. Many parts of the 
discussion do not depend on the size of a matrix, so we specify dimensions only when it matters. 
In particular, we usually do not state the size of a matrix when it is determined by the context. 

Several abbreviations are ubiquitous. Instead of self-adjoint, we often write s.a. Positive semi- 
definite becomes psd, and we shorten positive definite to pd. 

We write for the zero matrix and I for the identity matrix. The matrix Ejj has a unit entry in 
the {i,j) position and zeros elsewhere. The symbol Q is reserved for a unitary matrix. We adopt 
Parlett's convention [Par87] that bold capital letters symmetric about the vertical axis (A, . . . ,Y 
and A, . . . , ri) refer to s.a. matrices. 

The symbols Amin and Amax refer to the algebraic minimum and maximum eigenvalues of a 
s.a. matrix. We use curly inequalities to denote the semidefinite ordering: A )^ means that A is 
psd. The symbol ||-|| always refers to the £2 vector norm or the associated operator norm, which is 
called the spectral norm because it returns the maximum singular value of its argument. 

2.2. Conventions on Probability. We prefer to avoid unnecessary abstraction and technical 

detail, so we frame the standing assumption that all random variables are sufficiently regular 
that we are justified in computing expectations, interchanging limits, and so forth. Furthermore, 
we often state that a random variable satisfies some relation and omit the qualification "almost 
surely." We reserve the symbols X,Y for random s.a. matrices. 

2.3. Matrix Functions. Consider a function / : M — t- M. We define a map on diagonal matrices 
by applying the function to each diagonal entry. We then extend / to a function on s.a. matrices 
using the eigenvalue decomposition: 

/(A) :=Q-/(A)-Q* whereA = gAQ*. (2.1) 

The spectral mapping theorem states that each eigenvalue of /(A) is equal to /(A) for some eigen- 
value A of A. This point is obvious from our definition. 

Standard inequalities for real functions typically do not have parallel versions that hold for the 
semidefinite ordering. Nevertheless, there is one type of relation for real functions that always 
extends to the semidefinite setting: 

/(fl) ^ 5'(ffl) for a £ I =^ /(^) ^ fl'(^) when the eigenvalues of A lie in /. (2-2) 
We sometimes refer to (2.2) as the transfer rule. 
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2.4. The Matrix Exponential. The exponential of an s.a. matrix A can be defined by apply- 
ing (2.1) with the function f{x) = e^. Alternatively, we may use the power series expansion 

exp(A) :=I + E,=i^- 

The exponential of an s.a. matrix is always pd because of the spectral mapping theorem. On account 
of the transfer rule (2.2), the matrix exponential satisfies some simple semidefinite relations that 
we collect here. For each s.a. matrix A, it holds that 

I + A^e^, and (2.3) 

cosh(A) ^ e^'/2_ ^2.4) 

We often work with the trace of the matrix exponential, tr exp : A i— )■ tr e"^. The trace exponential 
function is convex. It is also monotone with respect to the semidefinite order: 

A4H =^ tre^<tre^. (2.5) 

See [Pet94, Sec. 2] for short proofs of these facts. 

The matrix exponential does not convert sums into products, but the trace exponential has a 
related property that serves as a limited substitute. The Golden-Thompson inequality [Bha97, 
Sec. IX. 3] states that 

tr e^+-^ < tr (e^e^) for ah s.a. A, H. (2.6) 
The obvious generalization of the bound (2.6) to three matrices is false [Bha97, Prob. IX. 8. 4]. 

2.5. The Matrix Logarithm. We define the matrix logarithm as the functional inverse of the 
matrix exponential: 

log(e^) := A for each s.a. matrix A. (2-7) 

This formula determines the logarithm on the pd cone, which is adequate for our purposes. 

The matrix logarithm interacts beautifully with the semidefinite order [Bha07, Exer. 4.2.5]. 
Indeed, the logarithm is operator monotone: 

A^H =^ log(A) =^ log{H). (2.8) 

The logarithm is also operator concave: 

r log( A) + (1 - T)log(i3") =^ log(r A + (1 - r)i?) for all pd A, ii" and r G [0, 1] . (2.9) 

Caveat lector: Operator monotone functions and operator convex functions are depressingly rare. 
In particular, the matrix exponential does not belong to either class [Bha97, Ch. V]. 

2.6. Dilations. An extraordinarily fruitful idea from operator theory is to embed matrices within 
larger block matrices, called dilations [Pau02]. The s.a. dilation of a rectangular matrix B is 



B 

B* 



Evidently, ^{B) is always s.a. A short calculation yields the important identity 

^(S)2 



BB* 
B*B 



(2.10) 



(2.11) 



It can also be verified that the s.a. dilation preserves spectral information: 

A„,ax(^(B)) = \\y{B)\\ = \\B\\. (2.12) 

We use dilations to extend results for s.a. matrices to rectangular matrices. See Remark 3.11 
and §4.2 for more information about this technique. 
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2.7. Expectation and the Semidefinite Order. Since the expectation of a random matrix 
can be viewed as a convex combination and the psd cone is convex, expectation preserves the 
semidefinite order: 

X 4Y almost surely =^ EX 4EY. (2.13) 

Every operator convex function admits an operator Jensen's inequality [HP03]. In particular, 
the matrix square is operator convex, which implies that 

{EXf4E{X'^). (2.14) 

The relation (2.14) is also a specific instance of Kadison's inequality [Bha07, Thm. 2.3.2]. 

3. Tail Bounds via the Laplace Transform Method 

This section develops some general probability inequalities for the maximum eigenvalue of a sum 
of independent random matrices. The main argument can be viewed as a matrix extension of the 
Laplace transform method for sums of independent real random variables. In the matrix setting, 
however, it requires great care to execute this technique successfully. 

3.1. Matrix Moments and Cumulants. Consider a random s.a. matrix X that has moments 
of all orders. By analogy with the classical scalar definitions, we may construct matrix extensions 
of the moment generating function (mgf) and the cumulant generating function (cgf): 

MxiO):=Ee^^ and Hx(6') := logEe^^ for 6* G M. (3.1) 

We admit the possibility that these expectations do not exist for all values of 6. The matrix cgf 
can be viewed as an exponential mean, a weighted average that emphasizes large deviations (with 
the same sign as 9). The matrix mgf and cgf have formal power series expansions: 

Mx{e)=I + Y, ,--HXn and Hx W = E ,^^p 
z — /p=i p\ ^ — ^p=i p] 

The coefficients E(XP) are called matrix moments, and we refer to as a matrix cumulant. The 
matrix cumulant has a formal expression as a (noncommutative) polynomial in the matrix 
moments up to order p. In particular, the first cumulant is the mean and the second cumulant is 
the variance: 

*i=EX and *2 = ^(X^) - (E X)^. 
Higher-order cumulants are harder to write down and interpret. 

3.2. The Laplace Transform Method for Matrices. We begin our main development with a 
striking idea drawn from the influential paper [AW02] of Ahlswede and Winter. Their work contains 
a matrix analog of the classical Laplace transform bound. We need the following variant, which is 
due to Oliveha [OhlOb]. 

Proposition 3.1 (The Laplace Transform Method). Let Y be a random self-adjoint matrix. For 
all t £ R, 

IP{Amax(l^) >t}< inf |e-^* • Etre^^l . 

In words, we can control tail probabilities for the maximum eigenvalue of a random matrix by 
producing a bound for the trace of the matrix mgf defined in (3.1). 

Proof. Fix a positive number 9. We have the chain of relations 

F{Amax(l") >t} = P{Amax(^l") > 9t} = F |e^--(^^) > e^*} < e"^* • Ee^---^^^). 
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The first identity uses the homogeneity of the maximum eigenvalue map, and the second relies on 
the monotonicity of the scalar exponential function; the third relation is Markov's inequality. To 
bound the exponential, note that 

e^-(^^) = A^,.(e^^)<tre^^. 

The identity is the spectral mapping theorem; the inequality holds because the exponential of an 
s.a. matrix is pd and the maximum eigenvalue of a pd matrix is dominated by the trace. Combine 
the latter two relations to reach 

P{Amax(l^) >t} <e-^*-Etre^^. 
This inequality holds for any positive 9, so we may take an infimum to complete the proof. □ 

3.3. The Failure of the Matrix mgf. In the scalar setting, the Laplace transform method is 
very effective for studying sums of independent random variables because the mgf decomposes. 
Consider an independent sequence {X/^} of real random variables. Operating formally, we see that 
the (scalar) mgf of the sum satisfies a multiplication rule: 

^(E.x.)W = lEexp(j;^^X,) =En,e^^'= =n,Ee^^'= =n,^x,W. (3.2) 

This calculation relies on the fact that the scalar exponential function converts sums to products, 
a property the matrix exponential does not share. As a consequence, there is no immediate analog 
of (3.2) in the matrix setting. 

Ahlswede and Winter attempt to imitate the multiplication rule (3.2) using the following obser- 
vation. When Xi and X2 are independent random matrices, 

trMx,+X2{0) < Etr [e^^^e^^^j ^ ^r [(Ee^^i)(Ee^^2)j ^ [MxA^) ■ Mx^iO)]. (3.3) 

The first relation is the Golden-Thompson trace inequality (2.6). Unfortunately, we cannot extend 
the bound (3.3) to include additional matrices. This cold fact suggests that the Golden-Thompson 
inequality may not be the natural way to proceed. In §3.7, we map out the route Ahlswede and 
Winter pursue, but we continue along a different path. 

3.4. A Concave Trace Function. For inspiration, we turn to the literature on matrix analysis. 
Some of the most beautiful and profound results in this domain concern the convexity of trace 
functions. We have observed that this theory has incredible implications for the study of random 
matrices. This paper demonstrates that a large class of matrix probability inequalities follows from 
a deep theorem [Lie73, Thm. 6] of Lieb that appears in his seminal work on convex trace functions. 

Theorem 3.2 (Lieb). Fix a self-adjoint matrix H. The function 

A I — > tr exp(i3" + log(A)) 
is concave on the positive- definite cone. 

Epstein provides an alternative proof of Theorem 3.2 in [Eps73, Sec. II], and Ruskai offers a 
simplified account of Epstein's argument in [Rus02, Rus05]. The note [Trollb] derives Lieb's 
theorem from the joint convexity of quantum relative entropy [Lin74, Lem. 2]. The latter approach 
is advantageous because the joint convexity result admits several elegant, conceptual proofs, such 
as [Eff09, Cor. 2.2]. 

We require a simple but powerful corollary of Lieb's theorem. This result describes how expec- 
tation interacts with the trace exponential. 

Corollary 3.3. Let H be a fixed self-adjoint matrix, and let X be a random self-adjoint matrix. 
Then 

Etrexp(i3" + X) < trexp(i3" + log(Ee^)). 
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Proof. Define the random matrix Y = e-^ , and calculate that 

E tr exp(i? + X) = E tr exp(i? + log(l")) < tr exp(i3" + log(E 1")) = tr exp(i3" + log(E e^)). 

The first identity follows from the definition (2.7) of the matrix logarithm because Y is always 
pd. Lieb's result, Theorem 3.2, ensures that the trace function is concave in Y, so we may invoke 
Jensen's inequality to draw the expectation inside the logarithm. □ 

3.5. Subadditivity of the Matrix cgf. Let us return to the problem of bounding the matrix mgf 
of an independent sum. Although the multiplication rule (3.2) is a dead end in the matrix case, the 
scalar cgf has a related property that submits to generalization. For an independent family {X^} 
of real random variables, the scalar cgf is additive: 

^iJ:.xdO) = logKe^p{'£^eX,) = Y,hgEe'^^ = Y,ExA0), (3-4) 

where the second identity follows from (3.2) when we take logarithms. 

Our key insight is that Corollary 3.3 offers a completely satisfactory way to extend the addition 
rule (3.4) for scalar cgfs to the matrix setting. We have the following result. 

Lemma 3.4 (Subadditivity of Matrix cgfs). Consider a finite sequence {Xk} of independent, 
random, self-adjoint matrices. Then 

Etrexp (Xlfc^^'^) ^ *^^^P forO eM. 

Proof. It does no harm to assume 6 = 1. Let E^ denote the expectation, conditioned on Xi, . . . , X^. 
Abbreviate 

Sk := log(Efc_ie^'^-) = log(Ee^'=), 
where the equality holds because the family {Xk} is independent. We see that 
^s. — ^'^ \ /s. — 



E tr exp {^^^^ Xkj = Eq • • • E„_i tr exp Xk + Xn 



< Eo • • • E„_2 tr exp [Y.I=i + 

= Eo • • • E„_2 tr exp ( 2^^^^ Xk + + H„ 

< Eo • • • E„_3 tr exp ( 2^^^^ Xk + H„_i + H, 



The first line relies on the tower property of conditional expectation. At each step m = 1, 2, . . . , n, 
we invoke Corollary 3.3 with the fixed matrix H equal to 



Em— 1 V^'' 



'fc=m+l 

This act is legal because Hm does not depend on X^.. □ 

Remark 3.5. To make the parallel with the addition rule (3.4) clearer, we can rewrite the conclusion 
of Lemma 3.4 in the form 



trexp (H(j^^,^^)(^)) < trexp {^^^xM 
by applying the definition (3.1) of the matrix cgf. 
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3.6. Tail Bounds for Independent Sums. This section contains abstract tail bounds for the sum 
of independent random matrices. Later, we will specialize these results to some specific situations. 
We begin with a very general inequality, which is the progenitor of our other results. 

Theorem 3.6 (Master Tail Bound for Independent Sums). Consider a finite sequence {X^} of 
independent, random, self-adjoint matrices. For all t £M, 

{Y.k^>') ^ ^ inf {e-^* -trexp ( loglEe^^'=) } . (3.5) 

Proof. Substitute the subadditivity rule for matrix cgfs. Lemma 3.4, into the Laplace transform 
bound. Proposition 3.1. □ 

Our first corollary adapts Theorem 3.6 to the case that arises most often in practice. We call 
upon this result several times to obtain tail bounds under a variety of assumptions about the 
structure of the random matrices. 

Corollary 3.7. Consider a finite sequence {X^} of independent, random, self-adjoint matrices 
with dimension d. Assume there is a function g : (0,oo) — )■ [0,oo] and a sequence {Ak} of fixed 
self-adjoint matrices that satisfy the relations 



Ee'^^'-- ^ e^^^)-^^- fore>0. (3.6) 

P ■= Amax (X^, 



Define the scale parameter 
Then, for all t G R, 

P {A^ax {Y.k ^k)>t}<d- inf e-^*+^('')-^. (3.7) 

Proof. The hypothesis (3.6) implies that 

logEe^^^- =^ g{e) • Afc for > (3.8) 

because of the property (2.8) that the matrix logarithm is operator monotone. Recall the fact (2.5) 
that the trace exponential is monotone with respect to the semidefinite order. As a consequence, 
we can introduce each relation from the family (3.8) into the master inequality (3.5). For each 
> 0, it follows that 



Amax [Y^k ^f") - 4 - ' (^^^^ ■ 

< e-^* • d ■ Amax (exp (^g{e) ■ J^,^ ^fc 



d ■ e-'^* • exp (^g{e) ■ A^ax (Y^k '^'^ 



The second inequality holds because the trace of a pd matrix, such as the exponential, is bounded 
by the dimension d times the maximum eigenvalue. The last line depends on the spectral mapping 
theorem and the fact that the function g is nonnegative. Identify the quantity p, and take the 
infimum over positive 9 to reach the conclusion (3.7). □ 

Remark 3.8. An alternative expression of the result (3.7) is that 

P {Amax ^k)>t]<d- exp sup {9t - g{9) ■p]^=d- exp {-p ■ g*{t/p)) . 

In words, the exponent in the tail bound can be written in terms of the perspective transformation 
of the Fenchel-Legendre conjugate of the function g. This inequality parallels the upper estimate 
in Cramer's classical result for large deviations [DZ98, Thm. 2.2.3]. 
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It is also worthwhile to state another consequence of Theorem 3.6. This bound is sometimes more 
useful than Corollary 3.7 because it combines the mgfs of the random matrices together under a 
single logarithm. 

Corollary 3.9. Consider a sequence {X^ : k = 1,2, ... ,n} of independent, random, self-adjoint 
matrices with dimension d. For all t E R, 



^max ( Y.l=i >t]<d- inf exp (^-9t + n • log A^ax J^^^i Ee^^'=^ ^ . (3.9) 



Proof. Recall the fact (2.9) that the matrix logarithm is operator concave. For each > 0, it follows 
that 

logEe''^'^ = n- - V" logEe^^'= =<( n-log f - V" Ee^^M . 
^-^k=i n ^-^k=i \ri ^-^k=i J 

The property (2.5) that the trace exponential is monotone allows us to introduce the latter relation 
into the master inequality (3.5) to obtain 

To complete the proof, we bound the trace by d times the maximum eigenvalue, and we invoke 
the spectral mapping theorem (twice!) to draw the maximum eigenvalue map inside the logarithm. 
Take the infimum over positive 9 to reach (3.9). □ 

We conclude this section with remarks on some other situations that we can analyze using the 
master tail bound, Theorem 3.6, and its corollaries. 

Remark 3.10 (Minimum Eigenvalue). We can study the minimum eigenvalue of a sum of random 
s.a. matrices because Amin(^) = — Amax(— As a result, 

^ {^-in (Efc ^0 ^*}=^ {^-- (E, ^ • 

In §5, we apply this observation to develop lower Chernoff bounds. 

Remark 3.11 (Maximum Singular Value). We can also analyze the maximum singular value of a 
sum of random rectangular matrices by applying these results to the s.a. dilation (2.10). For a 
finite sequence {Zi^} of independent, random, rectangular matrices, we have 



on account of (2.12) and the property that the dilation is real-linear. This device allows us to 
extend most of the tail bounds in this paper to rectangular matrices. See §4 for an application to 
Gaussian and Rademacher series. 

Remark 3.12 (Martingales). It is possible to combine the proofs of Lemma 3.4 and Theorem 3.6 
to obtain some simple results for matrix martingales. See the demonstration of the matrix Azuma 
inequality in §7 for an example of this approach. To reach fully detailed results for martingales, 
one must use a fundamentally different style of argument [OlilOa, Trolla]. 

3.7. The Ahlswede— Winter Method. Ahlswede and Winter use a different approach to bound 
the matrix mgf, which exploits the multiplicative bound (3.3) for the trace exponential of a sum of 
two independent, random, s.a. matrices. The reader may find their argument interesting. 

Consider a sequence {X^ : /c = 1, 2, . . . , n} of independent, random, s.a. matrices with dimension 
d, and let Y = J2k-^k- The trace inequality (3.3) implies that 

tr My (6*) < tr [ ( E e^^Zl ) ( E e^^" ) 1 < tr ( E e^^Zl sx^ ^ . j^^^^ ^ jg ^ox„ ^ _ 
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Iterating this procedure leads to the relation 

tr My (e) < (tri)- [ J]^ A^ax ( E e^^'= ) ] = d • exp A„,ax(logEe^^'=)) . (3.10) 

The bound (3.10) is the key to the Ahlswede- Winter method for producing probability inequalities. 
As a consequence, their approach generally leads to tail bounds that depend on a scale parameter 
involving "the sum of eigenvalues." See, for example, the bound (1.3) or the matrix probability 
inequalities presented in the papers [AW02, CMOS, Groll, Rec09]. 

In contrast, our result on the subadditivity of cumulants, Lemma 3.4, implies that 

tr My (9) < d ■ exp (A„,ax (jZfe ^^S^^''^')) ' (^'l^) 

Probability inequalities developed with (3.11) contain a scale parameter that involves the "eigen- 
value of a sum." See, for example, the bound (1.4). The exponent in (3.10) often exceeds the 
exponent in (3.11) by a factor of d, the ambient dimension, which is a serious loss. Section 4.8 
describes concrete situations where this discrepancy occurs. 



4. Case Study: Matrix Gaussian Series 



A matrix Gaussian series stands among the simplest instances of a sum of independent random 
matrices. Nevertheless, this example already exhibits several new phenomena that arise when we 
translate scalar tail bounds to the matrix setting. Consequently, we explore this fundamental case 
in depth as a way to develop insights about other matrix probability inequalities. 

4.1. Main Results. We begin with the scalar case. Consider a finite sequence {ofc} of real numbers 
and a finite sequence {7^} of independent standard Gaussian variables. We have the probability 
inequality 



where o"^ := 



(4.1) 



This result testifies that a Gaussian series with real coefficients satisfies a normal-type tail bound 
where the variance is controlled by the sum of the squared coefficients. The relation (4.1) follows 
easily from the scalar Laplace transform method. An alternative proof proceeds using the rotational 
invariance of a standard normal vector along with basic estimates on the error function. 

The inequality (4.1) generalizes directly to the noncommutative setting, as do many other scalar 
tail bounds. The matrix Laplace transform method. Proposition 3.1, delivers the following result 
on the tail behavior of a matrix Gaussian series. 

Theorem 4.1 (Matrix Gaussian and Rademacher Series). Consider a finite sequence {A^} affixed 
self-adjoint matrices with dimension d, and let {7^} he a finite sequence of independent standard 
normal variables. Compute the variance parameter 



Then, for all t > 0, 
In particular, 



E,7.Afc)>i}<d-e-*V2.^ 
E,7.A,||>^}<2d•e-*V2'^^ 



(4.2) 

(4.3) 
(4.4) 



The same hounds hold when we replace {7/2} hy a finite sequence of independent Rademacher random 
variables. 
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Observe that the bound (4.3) reduces to the scalar result (4.1) when the dimension d = 1. 
Of course, one may wonder whether the generalization (4.2) of the scalar variance is sharp and 
whether the dimensional dependence in (4.3) is necessary. A primary objective of this section is to 
demonstrate that Theorem 4.1 cannot be improved without changing its form. 

Most of the inequalities in this paper have variants that concern the maximum singular value 
of a sum of rectangular random matrices. These extensions follow immediately when we apply 
the s.a. results to the s.a. dilation of the sum of rectangular matrices. Here is the general version 
of Theorem 4.1, which serves as a model for other rectangular results. 

Corollary 4.2 (Rectangular Matrix Gaussian and Rademacher Series). Consider a finite sequence 
{-Bfc} of fixed matrices with dimension di x d2, and let {7fc} be a finite sequence of independent 
standard normal variables. Compute the variance parameter 

a2:=max{||^^S,S|,||, || } . 

Then, for all t > 0, 

P{||5^^,7feSfe|| >i} < {di+d2)-e-''/^''\ 

The same bound holds when we replace {'jk} by a finite sequence of independent Rademacher random 
variables. 

The proofs of Theorem 4.1 and Corollary 4.2 appear below in §4.2. Unlike our other results, 
these two bounds are not new. One established argument, which we discuss in §4.7, involves 
noncommutative Khintchine inequalities. It is also possible to prove these results using Oliveira's 
ideas [OhlOb]. 

4.2. Proofs. We continue with a short demonstration of the main results for matrix Gaussian and 
Rademacher series. The first step is to obtain a semidefinite bound for the mgf of a fixed matrix 
modulated by a Gaussian variable or a Rademacher variable. This mgf bound essentially appears 
in Oliveira's work [OlilOb, Lem. 2]. 

Lemma 4.3 (Rademacher and Gaussian mgfs). Suppose that A is an s.a. matrix. Let e be a 
Rademacher random variable, and let ^ be a standard normal random variable. Then 

Ee^'^^e'"^"/^ and Ee^^^ = e^'^'/^ j^^q^^^ 

Proof. Absorbing 6 into A, we may assume 6 = 1 in each case. We begin with the Rademacher 
mgf. By direct calculation, 

Ee"^ = cosh(A) ^e^'/^ 

where the second relation is (2.4). 

For the Gaussian case, recall that the moments of a standard normal variable satisfy 

E(72P+1)=0 and E{j^P) = for p = 0, 1, 2, . . . . 

Therefore, 

^ 2p ! ^ pi 

p=l ^ ' p=l 

The first identity holds because the odd terms in the series vanish. □ 



The tail bounds for s.a. matrix Gaussian and Rademacher series follow easily. 
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Proof of Theorem 4-1- Let {^fc} be a finite sequence of independent standard normal variables or 
independent Rademaclier variables. Invoke Lemma 4.3 to obtain 

]Ee?fe«^fc ^ e9{eyAl ^^^^^ ^^q^ ._ q2 ^ > o. 

Recall that 



Corollary 3.7 delivers 

^{>^rn..{^^ikA,)>t]<d.iniy'+^^'^)^ (4.5) 

For the record, the infimum is attained when 9 = ija^ . 

To obtain the norm bound (4.4), recall that = max{Amax(^), — Amm(^)}- Standard Gauss- 
ian variables and Rademacher variables are symmetric, so the inequality (4.5) implies 



-A, 



y 



Apply the union bound to the estimates for Amax and — Amin to complete the proof. □ 

The result for a series with rectangular matrix coefficients follows immediately when we apply 
Theorem 4.1 to the s.a. dilation of the series. 

Proof of Corollary 4-2. Let {^fc} be a finite sequence of independent standard normal random vari- 
ables or independent Rademacher random variables. Consider the sequence {^fc=5^(Bfc)} of random 
s.a. matrices with dimension di + d2- The spectral identity (2.12) ensures that 

ll^^^Cfc-Bfc = Amax (X/fc^'^"^'^)) ^ '^'"^^ (X/fc ^'^^'-"^'^'') ■ 

Thus, we may invoke Theorem 4.1 to obtain a probability inequality for the norm of the series. 
Simply observe that the matrix variance parameter (4.2) satisfies the relation 

Efc B,Bl 

on account of the identity (2.11) for the square of the s.a. dilation. □ 



4.3. Application: A Gaussian Matrix with Nonuniform Variances. It may not be imme- 
diately clear why abstract probability inequalities, such as Theorem 4.1 and Corollary 4.2, deliver 
information about interesting random matrices that arise in practice. Let us describe a simple 
application that speaks to this concern. 

Fix a. di X d2 matrix B, and draw a random di x d2 matrix T whose entries are independent 
standard normal variables. Let denote the componentwise (i.e., Schur or Hadamard) product of 
matrices. Construct the random matrix VQB, and observe that its (j, k) component is a Gaussian 

2 

variable with mean zero and variance \bjk\ ■ We claim that 

P{||r0S|| > t} < (di +(i2) •e"*'/^'^'' where cr^ = max |maxj , max^ ||&:fcf } ■ (4.6) 

The symbols bj- and h ^ represent the jih. row and kih. column of the matrix B. An immediate 
consequence of (4.6) is that the median of the norm satisfies 



M{\\TQB\\)<a^2\og{2{di + d2)). (4.7) 

There are nonuniform Gaussian matrices where the estimate (4.7) for the median has the correct 
order and other examples where the logarithmic factor is parasitic; see §§4.4-4.5 below. The reader 
may also wish to juxtapose (4.7) with the work of Seginer [SegOO, Thm. 3.1] and Latala [Lat05, 
Thm. 1] although these results are not fully comparable. 
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To establish (4.6), we first decompose the matrix of interest as a Gaussian series: 
Next, we must determine the variance parameter. Note that 



V (6,,E,.)(6,,E,,)* = E,(EJ^. 



E 



diag(||bi: 



Similarly, 
Therefore, 

= max |||diag(||6i;||^ , ||&2:||^ , • • • , , ||diag(||6;i|| 

= max |maxj max^ ||b:fe||^| • 

An application of Corollary 4.2 yields the tail bound (4.6). 



'■.d2\ 



'■.d2\ 



4.4. Controlling the Expectation. A remarkable feature of Theorem 4.1 is that it always allows 
us to obtain reasonably accurate estimates for the expected norm of the s.a. Gaussian series 



(4.^ 



To establish this point, we first compute upper and lower bounds for the second moment of 
Theorem 4.1 yields 



E( 



poo - ^ poo 

\Yf) = f\\\Y\\> Vi\ dt <2a'^ log(2d) + 2d e'^/^"' dt = 2a'^ log(2ed) 

Jo ^ ^ J2o-2log(2d) 



Jensen's inequality furnishes the lower estimate: 



n\\Y\ 



E lli^^ll > ||e(i^2)| 



The (homogeneous) first and second moment of the norm of a Gaussian series are equivalent up to 
a universal constant [LT91, Cor. 3.2], so we conclude that 



ca <E\\Y\\ < a^/2log{2ed). (4.9) 

This argument demonstrates that the matrix variance parameter cr^ controls the expected norm 
E II 1^11 up to a factor that depends very weakly on the dimension. A similar remark applies to the 
median value M(||l^||). 

4.5. The Dimensional Factor. In the inequality (4.9), the gap between the upper and lower 
bounds for IE ||"K|| arises because of the dimensional factor d in the statement (4.4). This dimensional 
dependence is a new feature of probability inequalities in the matrix setting. The extra term appears 
in each of our main results, and it is usually possible to identify a simple case where it is necessary. 

In particular, we cannot remove the factor d from the probability bound in Theorem 4.1. Observe 
that the norm of a diagonal Gaussian matrix is typically bounded below: 

E 7fcEfcfc = maxfc |7fc| > \/2logd with high probability. 
k=l 

Theorem 4.1 delivers the following tail bound for this series. 

The factor 2d ensures that this probability inequality does not become effective until t > 1^2 log (2d), 
comme il faut. 
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We can also identify situations where the dimensional term produces an overestimate of the ex- 
pected norm. For instance, consider a d-dimensional matrix drawn from the unnormalized Gaussian 
orthogonal ensemble (GOE): 

^= Yl Ijki'Ejk + 'Ekj) 

i<j<k<d 

The literature contains a sharp bound for the expected norm of this matrix: 

E\\W\\<2Vd (4.10) 

The result (4.10) follows from ideas of Gordon [Gor85, Gor92] elaborated in [DS02, Thm. 2.11]. 
Meanwhile, integrating the tail bound (4.4) from Theorem 4.1 yields the weaker result 



E \\W\\ < V((i + 3)log(2ed). (4.11) 

The estimate (4.11) is too large by a factor of about ylogd, which is the worst possible discrepancy 
in view of (4.9). 

Remark 4.4 (Effective Dimension). Let us stress that the nominal dimension of the matrices does 
not play a role in Theorem 4.1. If the ranges of the matrices Ai, A2, . . . are contained within a fixed 
r-dimensional subspace, we can replace the ambient dimension d with the effective dimension r. A 
similar remark applies to our other results. 

4.6. Comparison with Concentration Inequalities. It is fruitful to think about Theorem 4.1 
as a statement that the matrix Gaussian series (4.8) typically falls near its expectation as a random 
matrix when we measure the size of deviations using the operator norm: 

P{||r -EF|| >t}< 2d - e-*'/^'^'. (4.12) 

In contrast, the classical concentration inequality [Bog98, Thm. 1.7.6] concerns the variation of the 
norm about its mean value: 

P{| lll^ll \>t}< 2-e-*'/2'^* (4.13) 

where the scale for deviations depends on the weak variance parameter 

:= sup I \u*Akv\^ : ||u|| = ||^;|| = l} . (4.14) 

It can be shown [LT91, Cor. 3.2] that the bound (4.13) is asymptotically sharp as i — ?• 00. 

Let us elaborate on the relationship between the matrix variance o"^ defined in (4.2) and the 
weak variance appearing in (4.14). First, note that 

ctI< sup V tt*A|«=||V Ai =a^ (4.15) 
11^11=1^'= ll^'^ 

Equality holds in (4.15) when, for example, the family {A^} commutes. We can also establish a 
reverse inequality. 

= llEfc (E, ^^-^l) ^'^ll ^ ||^;P^ E, ^d-al (4.16) 

where {ej : j = 1, . . . ,d} is the standard basis for M"^. In the worst case^, the bound (4.16) has 
roughly the correct order. 

In summary, the matrix concentration inequality (4.12) always leads to a good estimate for 
the expected norm E||l^||. Nevertheless, the presence of the parameter in the tail bound can 
lead to a significant overestimate of the probability that is large. On the other hand, the 
classical inequality (4.13) contains no information about the mean, but it always produces a sharp 
large-deviation bound. Therefore, the two results complement each other well. 



worst-case example occurs with high probabihty when the sequence {Ak : k = 1, . . . ,d} consists of independent 
matrices drawn from the d-dimensional GOE, but the proof seems to be complicated. 
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4.7. Noncommutative Moment Inequalities. The matrix Laplace transform bound, Propo- 
sition 3.1 demonstrates that we can bound tail probabilities for the norm of a random series by 
controlling the matrix mgf. In certain special cases, it is possible to bound the matrix mgf us- 
ing noncommutative (nc) moment inequalities. Let us describe how to establish Theorem 4.1 in 
this fashion. This material is unrelated to the main development, so the reader may skip it with 
impunity. 

The nc Khintchine inequality provides an estimate for the expectation of the (2p)th moment 
of the Schatten 2p-norm of a matrix Gaussian series [LP86, LPP91, Pis03]. The most elementary 
formulation of this result states that 

KtT(j2,lkA,y'' <C2p-tv(j2k^iy for p = 1,2,3,.... (4.17) 
Buchholz [BucOl, Thm. 5] has shown that the optimal constant in (4.17) satisfies 

C,,:=E|7i|^- = (2p-l)n = M. 

The bound (4.17) also holds with the same constant when we replace {'yt} by a sequence of inde- 
pendent Rademacher variables [Buc05, Thm. 5]. 

The family (4.17) of inequalities allows us to develop a short proof of the tail bound for matrix 
Gaussian and Rademacher series. 

Alternative Proof of Theorem 4-1- Proposition 3.1 yields 

We may use (4.17) to bound the Taylor series for the matrix mgf term by term: 
E tr exp [e 7. A.) = ^^^^ ^ E tr 7. A.) 

02p 



Substitute (4.19) into (4.18), and select 6 = t/a'^ to complete the minimization. □ 

We may regard the mgf bound (4.19) as an "exponential generating function" for the family of 
nc Khintchine inequalities (4.17), but — unfortunately — the nc Khintchine inequalities do not follow 
as a consequence of this mgf bound. Recall that Lieb's result. Theorem 3.2, also delivers a proof 
of the inequality (4.19). This observation suggests that it might be possible to use Lieb's theorem 
to prove the nc Khintchine inequalities (4.17). We regard this as a tantalizing open question. 

4.8. Comparison with the Ahlswede Winter Bound. In §3.7, we describe how Ahlswede 
and Winter go about bounding the matrix mgf [AW02, App.]. It is natural to ask how inequalities 
developed using their approach compare with the results in this paper. 

Gaussian series provide an excellent illustration of the discrepancy between the two techniques. 
In this case, the Ahlswede-Winter method yields the probability inequality 



1/ ^k^k 
I ^ — 



>t^<2d- e-*'/2'^Aw where a\^^ := ||A^||. (4.20) 

The estimate (4.20) should be compared with our bound (4.4). The Ahlswede-Winter variance 
parameter 0"^-^^ always dominates the matrix variance parameter (4.2) because 

The two variance parameters rarely coincide, and the best reverse inequality is 



d-a^. 
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This worst-case behavior is typical. For instance, consider the two Gaussian matrices presented 
in §4.5. The Ahlswede- Winter tail bound (4.20) provides essentially no information about the norm 
of either matrix. 

Remark 4.5 (Moment Inequalities) . There is an alternative approach to establishing the result (4.20) 
that parallels the method presented in §4.7. We simply bound the Taylor series of the matrix mgf 
term by term using an appropriate family of moment inequalities: 

^^""''^ ^^"'-^^ for p= 1,2,3,.... 

These estimates follow from a result of Tomczak-Jaegermann [TJ74, Thm. 3.1] for Rademacher 
series together with the central limit theorem. 



5. Sums of Random Positive-Semidefinite Matrices 

The classical Chernoff bounds concern the sum of independent, nonnegative, and uniformly 
bounded random variables. In sympathy, matrix Chernoff bounds describe the extreme eigenvalues 
of a sum of independent, psd random matrices whose maximum eigenvalues are subject to a uniform 
bound. These probability inequalities demonstrate that the upper and lower tails of the sum exhibit 
binomial-type behavior. 

Our first result parallels the strongest versions of the scalar Chernoff inequality for the proportion 
of successes in a sequence of independent (but not identical) Bernoulli trials [Lug09, Exer. 7]. 

Theorem 5.1 (Matrix Chernoff I). Consider a sequence {X^ : k = l,2,...,n} of independent, 
random, self-adjoint matrices that satisfy 

Xk ^ and Amax(-X^fc) ^ 1 almost surely. 
Compute the minimum and maximum eigenvalues of the average expectation. 



-V] EXfc) and ^Wx := Amax ( - V], EXjA . 



Then 



P I A„,in ^l^^ Xk^<a^<d- e-'^-°(" li forO<a< fi^^, and 

Amax Y.l=i ^fc) > "} < • " ^-^"^ for Amax < O < 1. 

The binary information divergence D(a || u) := a(log(a) — log(?i)) -|- (1 — a)(log(l — a) — log(l — u)) 
for a,u G [0, 1]. 

We have found that the following weaker version of Theorem 5.1 produces excellent results but 
is simpler to apply. This corollary corresponds with the usual statement of the scalar Chernoff 
inequalities for sums of nonnegative random variables; see [Lug09, Exer. 8] or [MR95, §4.1]. 

Corollary 5.2 (Matrix Chernoff II). Consider a finite sequence {X^} of independent, random, 
self-adjoint matrices that satisfy 

Xk ^ and Amax(^/c) ^ R almost surely. 

Compute the minimum and maximum eigenvalues of the sum of expectations, 

/^min := Amin E Xfc^ and /imax := Amax ^ ^fc) • 
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Then 



mill } <d- 



{Amax Xk^ > (1 + (5)^niax} < d ■ 



(1 + 5) 



l+<5 



for 5 G [0, 1], and 



(l-<5)i-'5 

for 5>0. 



The proofs of Theorem 5.1 and Corollary 5.2 appear below in Section 5.1. We continue this 
discussion with some telegraphic remarks concerning various aspects of the Chernoff bounds. 

Remark 5.3 (Related Inequalities). The following standard simplification of Corollary 5.2 is useful. 
P {Xmin (J2k ^'^) - - ' e"^'"*^''''"'"/''^ for t G [0, 1], and 

1^^^ X]^ > t/imax| < d ■ 



'^max 



for t > e. 



"e" 
A. 

These inequalities manifest that the minimum eigenvalue has normal-type behavior and the maxi- 
mum eigenvalue exhibits Poisson-type decay. 

Remark 5.4 (Applications). Matrix Chernoff inequalities are very effective for studying random 
matrices with independent columns. Consider a rectangular random matrix 

Z = [zi Z2 ... 2n] 

where {-Zfc} is a family of independent random vectors in C"*. The norm of Z satisfies 



\Z\\^ — Xma.x{ZZ*) — Amax (^^^^_-|^ ^k^k 



Similarly, the minimum singular value Sm of the matrix satisfies 

In each case, the summands are stochastically independent and psd, so the matrix Chernoff bounds 
apply. See [TrolO] for a problem where this method applies. 

Remark 5.5 (Expectations). Corollary 5.2 produces accurate estimates for the expectation of the 
maximum eigenvalue: 

Aimax < E Amax (^^^ ^fc) < C • max {/^max, R log d} . 

The lower bound is Jensen's inequality; the upper bound follows from a messy — but standard — 
calculation. Observe that the dimensional dependence vanishes when the mean /^max is sufficiently 
large in comparison with the upper bound R\ 

Remark 5.6 (Dimensional Factor). The factor d in the Chernoff bounds cannot be omitted because 
of the coupon collector's problem [MR95, §3.6]. Consider a d-dimensional random matrix X with 
the distribution 

X = Ejj with probability d~^ for each j = 1,2, ... d. 
If {Xk} is a sequence of independent random matrices with the same distribution as X, then 



A„ 



^ Xk^ = with high probability unless n > dlogd. 



The dimensional factor in the lower Chernoff bound refiects this fact. The same example shows 
that the upper Chernoff bound must also exhibit a dimensional dependence. We have extracted 
this idea from [RV07, Sec. 3.5]. 
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Remark 5.7 (Previous Work). Theorem 5.1 is a considerable strengthening of the matrix Chernoff 
bound established by Ahlswede and Winter [AW02, Thm. 19]. Their proof requires the extra 
assumption that the summands are identically distributed, in which case their result matches 
Theorem 5.1. 

5.1. Proofs. To establish the matrix Chernoff inequalities, we commence with a semidefinite bound 
for the matrix mgf of a random psd contraction. 

Lemma 5.8 (Chernoff mgf). Suppose that X is a random psd matrix that satisfies Aniax(^) ^ 1- 
Then 

Ee^^ ^ 1 + (e^ - 1)(EX) forOGR. 

The proof of Lemma 5.8 parallels the classical argument; the matrix adaptation is due to 
Ahlswede and Winter [AW02, Thm. 19]. 

Proof. Consider the function /(x) = e^^ . Since / is convex, its graph lies below the chord connecting 
two points. In particular, 

fix) < /(O) + [/(I) - /(O)] • X for X G [0, 1]. 

More explicitly, 

e^^ < 1 + (e^ - 1) -x for xe [0,1]. 
The eigenvalues of X lie in the interval [0, 1], so the transfer rule (2.2) implies that 

e«^^I + (e^-l)X. 
Expectation respects the semidefinite order, so 

Ee^^ ^ 1 + (e^ - 1){EX). 
This is the advertised conclusion. □ 

We prove the upper Chernoff bounds first because the argument is slightly easier. 
Proof of Theorem 5.1, Upper Bound. The Chernoff mgf bound. Lemma 5.8, states that 

Ee''^'= ^ I + g{e) ■ (EXfe) where g{e) := e^ - 1 for 6* > 0. 
As a result. Corollary 3.9 implies 

F { Amax ^fc) > t} < d • exp {-et + n • log A^^ax Q + • E ^fc)) ) 

= d • exp {-et + n • log A^ax (l + g{e) ■'^Y.k^ 

= d • exp {-9t + n • log (1 + g{e) ■ /"Wx)) • (5.1) 

The third relation follows from basic properties of the eigenvalue map and the definition of /^max* 
Make the change of variables 1 1— t- na. The right-hand side is smallest when 

6 = log(a/(l - a)) - log(/imax/(l - A^max))- 

Substitute these quantities into (5.1) to obtain the information divergence upper bound. □ 

Proof of Corollary 5.2, Upper Bound. Assume that the summands satisfy the uniform eigenvalue 
bound with R = 1; the general result follows by re-scaling. The shortest route to the weaker 
Chernoff upper bound starts at (5.1). The numerical inequality log(l + x) < x, valid for x > —1, 
implies that 



K 



(Z]fc ^k)>t\<d- exp {-et + g{e) ■ n/imax) = d ■ exp {-Ot + g{e) ■ i^i^, 



Make the change of variables t i— (1 + 5)/imax, and select the parameter 6 = log(l + 6). Simplify 
the resulting tail bound to complete the proof. □ 
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The lower bounds follow from a closely related argument. 

Proof of Theorem 5.1, Lower Bound. We intend to apply Corollary 3.9 to the sequence {— X^}. In 
this case, the Chernoff mgf, Lemma 5.8, states that 

^ ^^{-e)Xk ^ j _ gi^Q^ . (eX^) where g(e) := 1 - e~^ for 6 > 0. 
The minimum eigenvalue Amin(— ^) = —^ma.x{A), so we can apply Corollary 3.9 as follows. 

<d-exp(et + n- log A,,ax - 9{d) ■ E X^)^ ^ 

= d-exp(^et + n- log ^1 - 5(6*) • A^ 
= d • exp (et + n • log (1 - g{e) • . (5.2) 

Make the substitution 1 1— ?• na. The right-hand side is minimal when 

= log(/2min/(l - fimin)) " log(a/(l - «)). 

These steps result in the information divergence lower bound. □ 

Proof of Corollary 5.2, Lower Bound. As before, assume that the uniform bound R = 1. We obtain 
the weaker lower bound as a consequence of (5.2). The inequality log(l + x) < x holds for x > — 1, 
so we have 

P |Amm ^kj < t| < • exp {9t - g{9) ■ n/2min) = d-exp [Ot - g{9) ■ fimin) 

Make the replacement i i— ?• (1 — 6)firaim and select 9 = — log(l — 5) to complete the proof. □ 

Remark 5.9 (Alternative Proof). Corollary 5.2 can also be established directly using Corollary 3.7 
instead of Corollary 3.9. In this case, we use the mgf bound 



Ee'''^ =^ exp (^(e - l)(EX)j for 6* G M, 

which follows instantly from Lemma 5.8 and the semidefinite relation (2.3). The remaining details 
mirror the arguments here. 

6. Matrix Bennett and Bernstein Inequalities 

In the scalar setting, Bennett and Bernstein inequalities describe the upper tail of a sum of 
independent, zero-mean random variables that are either bounded or subexponential. In the matrix 
case, the analogous results concern a sum of zero-mean random matrices. 

Our first result describes the case where the maximum eigenvalue of each summand satisfies a 
uniform bound. 

Theorem 6.1 (Matrix Bernstein: Bounded Case). Consider a finite sequence {Xj^} of independent, 
random, self-adjoint matrices with dimension d. Assume that 

EXfc = and Amax(Xfc) < R almost surely. 

Compute the norm of the total variance, 
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Then the following chain of inequalities holds for all t > 0. 



(i) 



^ ,d- exp(-3tV8o-^) for t < a'^/R; 
~ ' d • exp(-3t/8i?) for t > a^/R. ^"^^ 



The function h{u) := (1 + u) log(l + u) — u for n > 0. 

Observe that Theorem 6.1 places no assumption on the minimum eigenvalues of the summands, 
which may be arbitrarily small. As a consequence, when we apply the result to the two sequences 
{Xk} and {— Xfc}, the parameter R may differ. 

Theorem 6.1(i) can be viewed as a matrix version of the Bennett inequality [Lug09, Thm. 5], 
which implies that the tail probabilities exhibit Poisson-type decay. Part (ii) parallels a well-known 
result [Lug09, Thm. 6], which is perhaps the most famous among the probability inequalities 
attributed to Bernstein. Part (iii), which we call the split Bernstein inequality, clearly delineates 
between the normal behavior that occurs at moderate deviations and the slower decay that emerges 
in the tail. 

A related inequality holds when we allow the moments of the random matrices to grow at a 
limited rate, which we interpret as a matrix extension of the moment behavior of a subexponential 
random variable [dlPG02, Lem. 4.1.9]. 

Theorem 6.2 (Matrix Bernstein: Subexponential Case). Consider a finite sequence {X^} of 
independent, random, self-adjoint matrices with dimension d. Assume that 



EXfc = and K{Xl) 4 ^ ■ R"'^ Al for p = 2,3, A, 



2 

Compute the variance parameter 
Then the following chain of inequalities holds for all t > 0. 



^ ,d- exp(-tV4o-2) for t < a^/R; 
~ ^ d- exp{-t/4R) for t > a'^/R. 



The hypotheses of Theorem 6.2 are not fully comparable with the hypotheses of Theorem 6.1 
because Theorem 6.2 allows the random matrices to be unbounded but it also demands that we 
control the fluctuation of the maximum and minimum eigenvalues. The resulting tail bound is 
very similar to Theorem 6.1(ii). We cannot achieve a Bennett- type inequality, like Theorem 6.1(i), 
without stricter assumptions on the growth of moments. 

The proofs of Theorem 6.1 and 6.2 appear below. We finish the discussion with an assorted 
collection of enriching comments. 

Remark 6.3 (Rectangular Versions). The matrix Bernstein inequalities admit rectangular variants. 
For example, consider a sequence {Zi^.} of di x d2 random matrices that satisfy the assumptions 

¥. Zk = and \\Zk\\ < R almost surely. 
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We can apply Theorem 6.1 to the s.a. dilation (2.10) of the sum of these random matrices to see 
that the probability 

where d := di + d2 and where the variance parameter 

This argument leads to Theorem 1.6, stated in the introduction. There is also a rectangular 
extension of Theorem 6.2, but the hypotheses are messier. 

Remark 6.4 (Related Inequalities). There are too many variants of the scalar Bernstein inequality 
to present the matrix generalization of each one. Let us just mention a few of the possibilities. 

• Theorem 6.2 can be sharpened using an idea of Rio that appears in [Mas07, Sec. 2.2.3]. 

• When the random matrices exhibit moment growth of the form E(X^) ^ i?'P~^A^, we 
recover the Poissonian tail behavior captured in Theorem 6.1(i). 

• When the summands are symmetric random variables (i.e., X^, ~ — X^,), we can exploit 
the fact that the matrix mgf Ee^^*^ = Ecosh(^Xfc) to obtain arcsinh inequalities. 

Remark 6.5 (Expectations). We can use the matrix Bernstein inequality to bound the mean of the 
maximum eigenvalue of the random sum. For example, assume that the hypotheses of Theorem 6.1 
or 6.2 are in force. Then 



E A max 



^^Xfc) < C • max I cr A/log d, i?log(i|. (6.1) 



The upper bound follows by integrating Theorem 6.1(ii) or Theorem 6.2(i). Lower bounds seem to 
require additional assumptions. 

Remark 6.6 (Previous Work). Oliveira's results are quite similar to the bounds presented here. In 
particular, Oliveira's martingale inequality [OlilOa, Thm. 1.2] implies a weaker version of Theo- 
rem 6.1(ii). The main result from [OlilOb] has a similar flavor. 

6.1. Proof of Theorem 6.1. The main lemma shows how to bound the mgf of a zero-mean 
random matrix using a bound for its largest eigenvalue. 

Lemma 6.7 (Bounded Bernstein mgf). Suppose that X is a random s.a. matrix that satisfies 

EX = and Amax(X) < 1. 

Then 

Ee^^ ^expf(e^-0-l)-E(X2)) for 9 > 0. 



As usual, the proof of the mgf bound parallels a classical method, which we learned from corre- 
spondence with Yao-Liang Yu. 



Proof. Fix the parameter ^ > 0, and define a smooth function / on the real line: 

0x _ a -y q2 

f{x) = ^ forx/0 and /(O) = -. 

2 

An exercise in differential calculus verifies that / is increasing. Therefore, f{x) < /(I) when x < 1. 
The eigenvalues of X do not exceed one, so the transfer rule (2.2) implies that 

/(X)^/(1).L 

Expanding the matrix exponential and applying the latter relation, we discover that 
e'^^ = I + eX + X ■ f{X) ■ X 4I + 9X + /(I) • X^. 
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To complete the proof, we take the expectation of this semidefinite bound. 

Ee^^ ^ I + /(I) • E(X2) ^ exp (/(I) • E{X^)) = exp ((e^ -6-1)- E{X^) 

The second semidefinite relation follows from (2.3). □ 

We are prepared to establish the Bernstein inequalities for bounded random matrices. 

Proof of Theorem 6.1. We assume that R = 1] the general result follows by a scaling argument 
once we note that the summands are 1-homogeneous and the variance is 2-homogeneous. 

The main challenge is to establish the Bennett inequality, Part (i); the remaining bounds are 
consequences of simple numerical estimates. Invoke Lemma 6.7 to see that 

Ee^^'= ^ exp {g{e) ■ E where g{e) := e'' - - 1 for > 0. 

For each > 0, Corollary 3.7 implies that 

< d • exp [-9t + g{e) ■ A^ax E ) 

= d-exp(-0t + ^(0) -(j^) . 

The right-hand side attains its minimal value when 9 = log(l + t/a"^). Substitute and simplify to 
establish Part (i). 

The Bennett inequality (i) implies the Bernstein inequality (ii) because of the numerical bound 

hiu) > '—- for u> 0. 

The latter relation is established by comparing derivatives. 

The Bernstein inequality (ii) implies the split Bernstein inequality (iii). To obtain the subgaussian 
piece of (iii), observe that 

3 „ . , 2 



cj2 + Rt/3 - cj2 + R{ayR) /3 4^2 

because the left-hand side is a decreasing function of t for t > 0. Similarly, we obtain the subexpo- 
nential piece of (iii) from the fact 



t ^ (a'/R) _ 3 



^ — / o'L s = for t > a'^/R 



cj2 + Rt/3 - cj2 + R{(7^/R)/3 AR 
which holds because the left-hand side is an increasing function of t for t > 0. □ 

6.2. Proof of Theorem 6.2. We begin with the appropriate estimate for the matrix mgf. 

Lemma 6.8 (Subexponential Bernstein mgf). Suppose that X is a random s.a. matrix that satisfies 

EX = and E{XP) 4 ^ ■ for p = 2,3,4., ■■■ ■ 

Then ^ 

Ee^^ ^exp(^^^^-A2^ /orO<0<l. 

Proof. The argument proceeds by estimating each term in the Taylor series of the matrix exponen- 
tial. Indeed, 

Ee-.I..EX.g^.I.gf.^^.I.^.^^.e.p(^.^^). 

As usual, the last relation is (2.3). □ 
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The Bernstein inequahty for subexponential random matrices is an easy consequence of the 
previous lemma. 

Proof of Theorem 6.2. As before, we assume that R = 1] the general result follows by scaling. 
Invoke Lemma 6.8 to see that 

Ee^^'^ =^exp(5(0)- A2) where g{e) := for < < 1. 

For each > 0, Corollary 3.7 implies that 

IP {A,nax > < d • exp [-Qt + g(Q) ■ X^^, (j^,^ ^l)) = • exp {-Ot + g{e) ■ a^) . 

We select 9 = t/{a'^ + 1). Substitute and simplify to complete Part (i). 

The split inequality (ii) follows from Part (i) by the same argument presented in the proof of 
Theorem 6.1. □ 

7. The Matrix Hoeffding, Azuma, and McDiarmid Inequalities 

In this section, we prove some simple martingale deviation bounds by modifying the approach 
that we have used to study sums of independent random matrices. More sophisticated martingale 
results require additional machinery [OlilOa, Trolla]. 

7.1. Matrix Martingales. We begin with the required definitions. Let (r2,^,P) be a master 
probability space. Consider a filtration {^k} contained in the master sigma algebra: 

=^0 C ^1 C ^2 C • • • C ^oo C ^. 

Given such a filtration, we define the conditional expectation Efc[ • ] := E[ • | ^k\- A sequence {X^} 
of random matrices is adapted to the filtration when each JC^ is measurable with respect to 
Loosely speaking, an adapted sequence is one where the present depends only upon the past. 
An adapted sequence {Y^} of s.a. matrices is called a matrix martingale when 

E^:_iyfc = Yfc„i and E||yfc||<oo for A; = 1, 2, 3, . . . . 

We obtain a scalar martingale if we track any fixed coordinate of a matrix martingale {Ifc}- Given 
a matrix martingale {Y^}, we can construct the difference sequence 

Xfc := Ifc - for /c = 1,2,3, ... . 

Note that the difference sequence is conditionally zero mean: Efc_i = 0. 

7.2. Main Results. The scalar version of Azuma's inequality states that a scalar martingale 
exhibits normal concentration about its mean value, and the scale for deviations is controlled by 
the total maximum squared range of the diff'erence sequence. Here is a matrix extension. 

Theorem 7.1 (Matrix Azuma). Consider a finite adapted sequence {Xk} of self-adjoint matrices 
in dimension d, and a fixed sequence {Ak} of self- adjoint matrices that satisfy 

Efc_i Xk = and X| =^ A| almost surely. 



Compute the variance parameter 

(7.1) 



Then, for all t > 0, 

Theorem 7.1 can also be phrased directly in terms of a matrix martingale. 
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Corollary 7.2. Consider an s.a. matrix martingale {1^ : k = l,...,n} in dimension d, and 
let {Xk} be the associated difference sequence. Suppose that the difference sequence satisfies the 
hypotheses of Theorem 7.1, and compute the parameter a"^ according to (7.1). Then 

r{Xm.AYn -EYn)>t}<d- 6"*'/^^' . (7.3) 

We continue with a few tangential comments. 

Remark 7.3 (Rectangular Version). The matrix Azuma inequality has a rectangular version, which 
we obtain by applying Theorem 7.1 to the s.a. dilation (2.10) of the adapted sequence. 

Remark 7.4 (Related Inequalities). There are several situations where the constant 1/8 in the 
bound (7.2) can be improved to 1/2. One case occurs when each summand is conditionally 
symmetric; see Remark 7.8. Another example requires the assumption that X^ commutes almost 
surely with Ak, which allows us to generalize the classical proof [McD98, Lem. 2.6] of the Azuma 
inequality to the matrix setting. 

If we place the additional assumption that the summands are independent, Theorem 7.1 gives a 
matrix extension of one of Hoeffding's inequalities, which we have presented as Theorem 1.3 in the 
introduction. 

In the scalar setting, one of the most useful corollaries of Azuma's inequality is the bounded 
differences inequality of McDiarmid [McD98, Thm. 3.1]. This result states that a function of 
independent random variables exhibits normal concentration about its mean, and the variance 
depends on how much a change in a single variable can alter the value of the function. A version 
of the bounded differences inequality holds in the matrix setting. 

Corollary 7.5 (Matrix Bounded Differences). Let {Z^ : = 1, 2, . . . , n} he an independent family 
of random variables, and let H be a function that maps n variables to a self-adjoint matrix of 
dimension d. Consider a sequence {A^} of fixed self-adjoint matrices that satisfy 

{H{zi,...,Zk,...,Zn) - H{zi,...,z',„...,Zn))'^ 4 A^., 

where Zi and z\ range over all possible values of Zi for each index i. Compute the variance parameter 
Then, for all t > 0, 

P {K,UH{z) - E H{z)) >t}<d- e^*'/8-' 

where z = (Zi, . . . , Z„). 

The proofs of the matrix Azuma and McDiarmid inequalities appear in the next two sections. 

7.3. Proof of Theorem 7.1. The classical approach to Azuma's inequality does not seem to 
extend directly to the matrix setting. See [McD98, Lem. 2.6] for a short presentation of this 
argument. We use a different type of proof that is inspired by methods from probability in Banach 
space [LT91]. The main idea is to inject additional randomness into the sum via a symmetrization 
procedure. 

Lemma 7.6 (Symmetrization). Let H be a fixed s.a. matrix, and let X be a random s.a. matrix 
with EX = 0. Then 

Etre^+^ < Etre^+2^^, 
where e is a Rademacher variable independent from X. 
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Proof. Construct an independent copy X' of the random matrix, and let E' denote integration with 
respect to the new variable. Since the matrix is zero mean, 

Etre-^+^ = Etre-^+^-'='^' < Etr e-^+(^-^') = Etr e-^+^^^-^T 

We have used the convexity of the trace exponential to justify Jensen's inequality. Since X — X' 
is a symmetric random variable, we can modulate it by an independent Rademacher variable e 
without changing its distribution. The final bound depends on a short sequence of inequalities: 

Etre^+^ < Etr (e^/2+-^ . e^/^-^^') < E [(tre^+2^^)'/' • ( tr e-^-^.xy /2 

< (E tr e-^+2.X) 1/2 . ^ e^-2eX') 1/2 ^^^^ ^H+2eX 

The first relation is the Golden-Thompson inequality (2.6); the second is the Cauchy-Schwarz 
inequality for the trace; and the third is the Cauchy-Schwarz inequality for real random variables. 
The last identity follows because the two factors are identically distributed. □ 

The other essential ingredient in the proof is a conditional bound for the matrix cgf of a sym- 
metrized random matrix. 

Lemma 7.7 (Azuma cgf). Suppose that X is a random s.a. matrix and A is a fixed s.a. matrix 
that satisfy X"^ ^ . Let e he a Rademacher random variable independent from X . Then 

log E [e^^^^ I X] 4 29^ A^ forOeR. 

Proof. We apply the Rademacher mgf bound. Lemma 4.3, conditionally to obtain 

E[e2^^^|X] =^e2^'^'. 

The fact (2.8) that the logarithm is operator monotone implies that 

logE [e'^^^^ I X] 4 29'^ X^ 4 29^ A^, 

where the second relation follows from the hypothesis on X . □ 

We are prepared to establish the matrix Azuma inequality. The proof involves an iteration 
similar to the argument that implies the subadditivity of cgfs, Lemma 3.4, for sums of independent 
random matrices. 

Proof of Theorem 7.1. The matrix Laplace transform method. Proposition 3.1, states that 



A, 



The main difficulty in the proof is to bound the matrix mgf, which we accomplish by an iterative 
argument that alternates between symmetrization and cumulant bounds. 

Let us detail the first step of the iteration. Define the natural filtration '■= ^{X\, . . . ,Xk) 
of the process {X^}- Then we may compute 

Etrexp fV 9Xk) = EE [trexp fV"~^ 9Xk + 9Xn) I ^n-i 



< EE 



r. 



trexp (J2lll + 2e0X, 

< Etrexp (X]fcZ| + logE [e^^^^" | ^„ 

< Etrexp fy"~^ 9Xk + 29^ AI 



The first identity is the tower property of conditional expectation. In the second line, we invoke 
the symmetrization method. Lemma 7.6, conditional on ^n~i^ and then we relax the conditioning 
on the inner expectation to the larger algebra By construction, the Rademacher variable e 
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is independent from so we can apply the concavity result, Corollary 3.3, conditional on J^„. 
Finally, we use the fact (2.5) that the trace exponential is monotone to introduce the Azuma cgf 
bound, Lemma 7.7, in the last inequality. 
By iteration, we achieve 



E tr exp SXk) < tr exp [26^ j . (7.5) 

Note that this procedure relies on the fact that the sequence {A^} of upper bounds does not 
depend on the values of the random sequence {Xj^}. Substitute the mgf bound (7.5) into the 
Laplace transform bound (7.4), and observe that the infimum is achieved when 9 = t/Aa'^. □ 

Remark 7.8. Suppose that the sequence {X^.} is conditionally symmetric: 

~ —Xk conditional on ^k-i- 

When we execute the proof of Theorem 7.1 under this assumption, we can symmetrize each term 
in the sum without suffering an extra factor of two. For example, 



E 



tr exp [2^^^^ OXk + 9Xn) \ ^n-i\ = E [trexp [2^^^^ OXu + eOX^j \ ^„ 



where e is independent from The rest of the proof remains the same, but the analog of the 
bound (7.2) has a constant of 1/2 instead of 1/8 in the exponent. 

7.4. Proof of Corollary 7.5. Finally, we establish the matrix version of the bounded differences 
inequality. The main idea in the argument is to construct the Doob martingale associated with the 
natural filtration of the independent random sequence. We compute semidefinite bounds for the 
difference sequence, and then we apply the matrix Azuma inequality to control the deviations of 
the martingale. 

Proof of Corollary 7.5. In this argument only, we write Kz for the expectation with respect to a 
random variable Z, holding other variables fixed. Recall that z = {Zi, . . . , Zn). For /c = 0, 1, . . . , n, 
consider the random matrices 

Yk := E[Hiz) \Z,,Z2,...,Z,]= Ez,^, E^,^, . . . Ez„ H{z). 

The sequence {Ifc} forms a Doob martingale. The associated difference sequence is 

Xfc := - = Ez,+, Ez,^, . . . Ez,, {H{z) - E^, H{z)) , 

where the second identity follows from independence and Fubini's theorem. 

It remains to bound the difference sequence. Let Z'f, be an independent copy of Zi^, and construct 
the random vector z' = {Zi, . . . , Z'f^, Zf^^i, . . . , Z„). Observe that Ez^. H{z) = E^^ H{z') and 
that H{z) does not depend on Z'^. Therefore, we can write 

Xk = Ez,^, Ez,^, ...Ez„ Ez'^ {H{z) - H{z')) . 

The vectors z and z' differ only in the kih. coordinate, so that 

{H{z) - H{z')f ^ Al 

by definition of the bound A|. Finally, the semidefinite Jensen inequality (2.14) for the matrix 
square yields 

Xl ^ Ez,^, Ez,^, ...Ez„ E^, {H{z) - H{z')f 4 AI 
To complete the proof, we apply (7.3) to the martingale {Yk}. □ 
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